Early resolving instructions

ABSTRACT

Techniques are disclosed for handling control transfer instructions in pipelined processors. Such instructions may cause the sequence of subsequent instructions to change, and thus may require subsequent instructions to be deleted from the processor&#39;s pipeline. Pre-decode means ( 110 ) are provided for at least partially decoding control transfer instructions early in the pipeline. Subsequent instructions can then be prevented from progressing through the pipeline. The mechanism required to delete unwanted instructions is thereby simplified.

The present invention relates to parallel pipelined processors, such asvery long instruction word (VLIW) processors. The present invention isparticularly concerned with the way in which certain controlinstructions are handled in such processors. Such control instructionsmay be instructions which, if executed, cause the sequence of subsequentinstructions to change. Such instructions are referred to herein ascontrol transfer instructions.

Modern processors use a technique known as pipelining to increase therate at which instructions can be processed. Pipelining works byexecuting an instruction in several phases, with each phase beingexecuted in a single pipeline stage. Instructions flow throughsuccessive pipeline stages, and complete execution when they reach theend of the pipeline.

Some processor architectures provide two or more parallel pipelines forprocessing different instructions, or different parts of an instruction,simultaneously. For example, VLIW processors use long instructionpackets which may be divided into smaller instructions for simultaneousexecution in different processor pipelines. Typically the address of aninstruction packet is computed by one of the pipelines (the “master”pipeline), and the computed address is distributed to the otherpipelines (the “slave” pipelines). Each pipeline then fetches its owninstruction, and decodes and executes that instruction. Each of theseoperations is normally carried out in a separate pipeline stage.

The ability of the master pipeline to compute the address of aninstruction relies on the fact that the next address can be predicted inadvance with a fair degree of certainty. For example, if the processoris running a loop, then the address of the next instruction will, inmost cases, be either the next address in memory, or the address of thefirst instruction in the loop. Thus the processor is able to compute theaddresses of instructions and load the instructions into the pipelinesin earlier pipeline stages while preceding instructions are still beingdecoded and executed in later pipeline stages.

A problem in the arrangement described above is that certaininstructions, when decoded and executed, may cause the addresses ofsubsequent instructions to be different from those already computed bythe processor. For example a branch instruction, if acted on, causes theprocessor to jump to an address which is not the next address in memory.In such a situation, some or all of the instructions which are inearlier pipeline stages must be removed from the pipelines (“squashed”),because they may have been loaded from the incorrect address. Inparallel pipelined processors, the removal of such unwanted instructionsmay require a large amount of logic, which may add to the chip area ofthe processor and potentially slow down the operating speed of theprocessor.

According to a first aspect of the present invention there is provided aprocessor comprising a plurality of parallel pipelines for performing aseries of operations on instructions from an instruction packet passingthrough the pipelines, wherein the processor is arranged such that, inoperation, an instruction in a first pipeline is at least partiallydecoded before an instruction from a subsequent instructions packet isfetched by a second pipeline.

By arranging for an instruction in a first pipeline to be at leastpartially decoded before an instruction from a subsequent instructionpacket is fetched by a second pipeline, appropriate action can be takenin response to the instruction at an early stage. For example, if theinstruction is a control transfer instruction, then subsequentinstructions may be prevented from being loaded into the pipelines untilthe processor has correctly computed the address of the nextinstruction. This may simplify the mechanism required to delete unwantedinstructions, or remove the need for such a mechanism altogether.

The second pipeline may comprise an instruction fetch stage, and thefirst pipeline may comprise pre-decode means for at least partiallydecoding the instruction, which pre-decode means may be provided eitherin a stage corresponding to the instruction fetch stage in the secondpipeline, or in an earlier stage.

The pre-decode means may be arranged to determine whether theinstruction is an instruction which, if executed, may cause a sequenceof subsequent instructions to be different from that if the instructionwas not executed. Such an instruction may be, for example, a controltransfer instruction. The pre-decode means may be arranged to determinewhether the instruction is one of a predetermined set of instructions.The predetermined set of instructions may comprise, for example, one ormore of a loop instruction, a branch instruction, areturn-from-VLIW-mode instruction, an exit instruction (causing theprocessor to exit from a loop), a subroutine call instruction and/orother instructions. The predetermined set of instruction preferablycomprises instructions that may only be executed by the first pipeline.

It will be appreciated from the above that, in accordance withembodiments of the present invention, action may be taken in response tocertain control transfer instructions at an earlier stage in theprocessor pipeline than would normally be the case. Such instructionsare referred to as early resolving control transfer instructions.

The processor may further comprise stalling means for stalling theprogress of subsequent instructions through the pipelines in dependenceon a result of the partial decoding of the instruction. In this way,subsequent instructions can be prevented from being loaded into and/orprogressing through the pipelines, for example, if the instruction is acontrol transfer instruction. However, the stalling means may bearranged such that a subsequent instruction remains in at least onestage (e.g. the first stage) of the first pipeline while a stall isasserted. In this case, if the control transfer instruction is notexecuted, then the subsequent instruction which was held in the firstpipeline may still be used. If the control transfer instruction isexecuted, then the subsequent instruction held in the first pipeline mayneed to be deleted, but the mechanism required to achieve this can beimplemented relatively easily.

The first pipeline may comprise an execute stage, the execute stagecomprising means for determining whether an instruction in the executestage is to be executed. By this stage, the processor may be able tocompute the address of a subsequent instruction correctly. Thus thestalling means may be arranged to release a stall when it has beendetermined whether an instruction which caused the stall is to beexecuted.

The processor may further comprise means for deleting a subsequentinstruction if it is determined that the instruction which caused thestall is to be executed. This may be necessary, in particular, if thestalling means is arranged such that a subsequent instruction remains inthe first pipeline while a stall is asserted.

The processor may be arranged such that, in dependence on a result ofthe partial decoding of the instruction (e.g if it is determined thatthe instruction is an early resolving control transfer instruction), aninstruction from a subsequent instruction packet is prevented from beingloaded into the second pipeline in an executable form. By this it may bemeant, for example, that the instruction from the subsequent instructionpacket is not loaded into the second pipeline at all in the next clockcycle, or that the second pipeline contains some indication that theinstruction is to be ignored, such as a cache miss signal. In this way,it may not be necessary to delete any unwanted instructions from thesecond pipeline if a control transfer instruction is executed in thefirst pipeline.

The first pipeline preferably comprises a decode stage for fullydecoding an instruction, the decode stage being after the stage in whichthe instruction is at least partially decoded, and preferably before anexecute stage.

Each of the pipelines may have an instruction fetch means for fetchingan instruction to be executed, and the instruction fetch means in thefirst pipeline may be in an earlier pipeline stage than the instructionfetch means in the second pipeline. For example, the instruction fetchmeans in the first pipeline may be in the same pipeline stage as thepre-decode means. In this way, an instruction may be fetched andpartially decoded in the first pipeline before an instruction from thesame instruction packet has been fetched in the second pipeline. Thus,if the instruction in the first pipeline is a control transferinstruction, then appropriate action (such as stalling the pipelines)may be taken before the corresponding instruction has been loaded intothe second pipeline. This can allow the mechanism for deleting anyunwanted instructions from the pipelines to be simpler than wouldotherwise be the case.

The processor may be arranged such that corresponding instructions indifferent pipelines may become unaligned for at least one clock cycle.This may be due to, for example, stall signals taking effect indifferent pipelines in different clock cycles. Such an arrangement wouldnormally result in a complicated mechanism for deleting unwantedinstructions. Thus the present invention may advantageously be used withsuch an arrangement, since it may simplify the mechanism required fordeleting unwanted instructions.

The processor may comprise a plurality of pipeline clusters, eachcluster comprising a plurality of pipelines, the first pipeline being ina first cluster and the second pipeline being in a second cluster. Theprocessor may be, for example, a very long instruction word processor inwhich instructions are issued in parallel to the plurality of pipelines.

The feature that different pipelines fetch instructions in differentpipeline stages is an important part of the present invention and may beprovided independently. Thus, according to a second aspect of thepresent invention, there is provided a processor comprising a pluralityof parallel pipelines for performing a series of operations oninstructions from an instruction packet passing through the pipelines,each of the pipelines comprising instruction fetch means for fetching aninstruction from the instruction packet, wherein the instruction fetchmeans in one pipeline is in a different pipeline stage from theinstruction fetch means in another pipeline.

Corresponding methods are also provided. Thus, according to a thirdaspect of the invention, there is provided a method of operating aprocessor, the processor comprising a plurality of parallel pipelinesfor performing a series of operations on a group of instructions passingthrough the pipelines, the method comprising at least partially decodingan instruction in a first pipeline before an instruction from asubsequent group of instructions is fetched by a second pipeline.

Features of one aspect of the invention may be applied to any otheraspect. Apparatus features may be applied to the method aspects and viceversa.

Preferred features of the present invention will now be described,purely by way of example, with reference to the accompanying drawings,in which:—

FIG. 1 shows an overview of a processor embodying the present invention;

FIG. 2 is a block diagram of a master cluster in a processor embodyingthe invention;

FIG. 3 is a block diagram of a slave cluster in a processor embodyingthe invention;

FIGS. 4(a), 4(b) and 4(c) show an example of a software pipelined loop;

FIG. 5 shows the use of predicates in a software pipeline loop;

FIG. 6 shows how a predicate register may be used to produce thepredicates of FIG. 5;

FIG. 7 shows various pipeline stages in a processor embodying theinvention;

FIG. 8 shows an example of unaligned stale instruction packets;

FIG. 9 shows parts of a processor in accordance with an embodiment ofthe present invention;

FIGS. 10(a) and 10(b) show the processor of FIG. 9 with a loopinstruction in the IF and X1 stages respectively; and

FIGS. 11 to 13 illustrate the operation of a processor embodying theinvention using the examples of loop, branch and rv instructionsrespectively.

Overview of a Parallel Pipelined Processor

FIG. 1 shows an overview of a parallel pipelined processor embodying thepresent invention. The processor 1 comprises instruction issuing unit10, schedule storage unit 12, first, second, third and fourth processorclusters 14, 16, 18, 20 and system bus 22 connected to random accessmemory (RAM) 24, and input/output devices 26. As will be explained, eachof the clusters 14, 16, 18, 20 contains a number of execution unitshaving a shared register file.

The processor 1 is designed to operate in two distinct modes. In thefirst mode, referred to as scalar mode, instructions are issued to justthe first cluster 14, and the second to fourth clusters 16, 18, 20 donot perform any computational tasks. In the second mode, referred to asVLIW mode, instructions are issued in parallel to all of the clusters14, 16, 18, 20, and these instructions are processed in parallel. Agroup of instructions issued in parallel to the various clusters in VLIWmode is referred to as a VLIW instruction packet. In practice, theprocessor architecture may be configured to include any number of slaveclusters. Each VLIW instruction packet contains a number of instructions(including no-operation instructions) equal to the total number ofclusters times the number of execution units in each cluster.

When the processor is in VLIW mode, VLIW instruction packets are passedfrom the schedule storage unit 12 to the instruction issuing unit 10. Inthis example, the VLIW instruction packets are stored in compressed formin the schedule storage unit 12. The instruction issuing unit 10decompresses the instruction packets and stores them in a cache memory,known as the V-cache. The various constituent instructions in theinstruction packets are then read out from the V-cache and fed to theclusters 14, 16, 18, 20 via the issue slots IS1, IS2, IS3, IS4respectively. In practice, the functions of the instruction issuing unit10 may be distributed between the various clusters 14, 16, 18, 20.Further details of the instruction issuing unit 10 may be found inUnited Kingdom patent application number 0012839.7 in the name ofSiroyan Limited, the entire subject matter of which is incorporatedherein by reference.

The master cluster 14 controls the overall operation of the processor 1.In addition, certain control instructions are always sequenced so thatthey will be executed in the master cluster. The block structure of themaster cluster 14 is shown in FIG. 2. The master cluster comprises firstand second execution units 30, 32, control transfer unit (CTU) 34,instruction register 36, I-cache 38, V-cache partition 40, codedecompression unit (CDU) 42, local memory 44, data cache 46, system businterface 48, control and status registers 50, and predicate registers(P-regs) 52.

In operation, when the processor is in scalar mode, instructions arefetched one at a time from the I-cache 38 and placed in the instructionregister 36. The instructions are then executed by one of the executionunits 30, 32 cr the control transfer unit 34, depending on the type ofinstruction. If an I-cache miss occurs, a cache controller (not shown)arranges for the required cache block to be retrieved from memory.

When the processor is in VLIW mode, two instructions are fetched inparallel from the V-cache partition 40 and are placed in the instructionregister 36. The V-cache partition 40 is the part of the V-cache whichstores VLIW instructions which are to be executed by the master cluster14. The two instructions in the instruction register are issued inparallel to the execution units 30, 32 and are executed simultaneously.The V-cache partitions of all clusters are managed by the codedecompression unit 42. If a V-cache miss occurs, the code decompressionunit 42 retrieves the required cache block, which is stored in memory incompressed form, decompresses the block, and distributes the VLIWinstructions to the V-cache partitions in each cluster. An address indecompressed program space is referred to as an imaginary address. AVLIW program counter (VPC) points to the imaginary address of thecurrent instruction packet. As well as the VLIW instructions, V-cachetags are also stored in V-cache partition 40, to enable the codedecompression unit 42 to determine whether a cache miss has occurred.

FIG. 3 shows the block structure of a slave cluster 16. The slavecluster 16 comprises first and second execution units 60, 62,instruction register 64, V-cache partition 66, local memory 68, systembus interface 70, status registers 72, and predicate registers (P-regs)74. When the processor is in VLIW mode, instruction execution iscontrolled by the master cluster 14, which broadcasts an addresscorresponding to the next instruction packet to be issued. Theinstructions in the instruction packet are read from the V-cachepartition in each cluster, and proceed in parallel through the executionunits 60, 62.

A contiguous sequence of VLIW instruction packets is referred to as aVLIW code schedule. Such a code schedule is entered whenever theprocessor executes a branch to VLIW mode (by) instruction in scalarmode. The code within a VLIW schedule consists of two types of codesection: linear sections and loop sections. On entry to each VLIW codeschedule, the processor begins executing a linear section. This mayinitiate a subsequent loop section by executing a loop instruction. Loopsections iterate automatically, terminating when the number of loopiterations reaches the value defined by the loop instruction. It is alsopossible to force an early exit of a loop by executing an exitinstruction. When the loop section terminates a subsequent linearsection is always entered. This may initiate a further loop section, orterminate the VLIW schedule (and cause a return to scalar mode) byexecuting an return from VLIW mode (rv) instruction.

A loop section is entered when the loop initiation instruction (loop) isexecuted. This sets up the loop control context and switches theprocessor into VLIW loop mode. The processor then executes the loopsection code repeatedly, checking that the loop continuation conditionstill holds true prior to the beginning of each iteration (excluding thefirst iteration). The loop control operation involves a number ofregisters which are provided in the master cluster. These registers aredescribed below.

-   -   LVPC—loop start VPC value. This points to the imaginary address        of the first packet in the loop section. It is loaded from VPC+1        when the loop instruction is executed and is used to load the        value back to VPC at the end of each loop iteration to allow VPC        to return to the start of the loop.    -   VPC—VLIW program counter. This points to the imaginary address        of the current packet. It is loaded from LVPC at the end of        every loop iteration or is simply incremented by 1. It is also        incremented by the literal from a branch instruction when the        branch instruction is executed.    -   LPC—loop start PC value. This points to the start of the first        compressed frame in memory in the block that contains the first        packet in the loop section. It used when refilling the V-cache.    -   PC—program counter. When in VLIW mode, this points to the start        of the current compressed block in memory.    -   IC—iteration count. This register is used to count the number of        loop iterations, and is decremented for each iteration of the        loop. It is loaded whenever the loop instruction is executed        before entering the loop section.    -   EIC—epilogue iteration count. This register is used to count the        number of loop iterations during the shutdown (epilogue) phase        of a software pipelined loop (see below).    -   CC—compression count. This indicates the size of the compressed        block and is used for updating the value of PC.    -   LSize—loop size. This register contains the number of packets in        the loop sequence. It is loaded whenever the loop instruction is        executed. The loop instruction explicitly defines the number of        packets in the loop section.    -   LCount—This register counts the number of loop packets, and is        decremented with each new packet. When LCount becomes zero a new        loop iteration is initiated. LCount is loaded from LSize at the        beginning of a new loop iteration.

The above registers are all “early modified”, that is, they are modifiedbefore the processor has committed to a change in the processor contextdue to the instruction. Each register has a backup register in order tobe able to restore the processor to its last committed state whenperforming exception handling.

Typically, a linear section of VLIW code is used to set up the contextfor the execution of a software pipelined loop. A software pipelinedloop works by executing different iterations of the same loop indifferent clusters in an overlapped manner. FIGS. 4 shows anillustrative example of a software pipelined loop. FIG. 4(a) shows theloop prior to scheduling. The loop contains a plurality of instructionswhich are to be executed a number of times (seven in this example). FIG.4(b) shows the loop scheduled into five stages, each stage containing anumber of instructions. The first stage contains the instructions whichare required to be executed before a subsequent iteration can bestarted. This stage has a length referred to as the initiation interval.The other stages are arranged to be of the same length. FIG. 4(c) showshow the various iterations of the loop schedule are sequenced in theclusters. In this example, a total of seven iterations of a loopschedule are executed, and it is assumed that seven clusters areavailable. Each iteration is executed in a different execution unit,with the start times of the iterations staggered by the initiationinterval.

Referring to FIG. 4(c), it can be seen that the pipeline loop scheduleis arranged into a prologue (startup) phase, a kernel phase and anepilogue (shutdown) phase. The prologue and epilogue phases need to becontrolled in a systematic way. This can be done through use of thepredicate registers 52, 74 shown in FIGS. 2 and 3. The predicateregisters 52, 74 are used to guard instructions passing through thepipelines either true or false. If an instruction is guarded true thenit is executed, while if an instruction is guarded false then it is notexecuted and it is converted into a no-operation (NOP) instruction. Inorder to control the prologue and epilogue phases of a software pipelineloop, all instructions in pipeline stage i are tagged with a predicateP_(i). P_(i) is then arranged to be true whenever pipeline stage ishould be enabled. FIG. 5 shows how the predicates for each softwarepipeline stage change during the execution of the loop.

In order to change the predicates during the execution of the loop, thepredicate values are stored in a shifting register, which is a subset ofone of the predicate registers, as shown in FIG. 6. A further bit in thepredicate register contains a value known as the predicate seed. Theshift register subset initially contains the values 00000. When a loopis to be started, a 1 is loaded into the predicate seed. This 1 isshifted into the shift register subset prior to the first iteration, sothat the values stored therein become 00001. This turns on pipelinestage 1, but leaves stages 2 through 5 disabled. When the first stage ofthe pipeline loop has completed (i.e. after a number of cycles equal tothe initiation interval), the values in the shift register are shiftedto the left, so that the shift register subset contains the values00011. This pattern continues until the shift register subset containsthe values 11111. All of the software pipeline stages are then turnedon, and the loop is in the kernel phase.

When a number of iterations equal to the iteration count have beenexecuted (in this case seven), the seed predicate is then set to zero.At this point the loop enters the epilogue phase, and zeros are shiftedinto the shift register subset to turn off the software pipeline stagesin the correct order. When all of the pipeline stages have been turnedoff and the shifting predicate register contains 00000 again the loophas completed. The processor then exits the loop mode and enters thesubsequent linear section.

At any time the loop itself can initiate an early shutdown by executingan exit instruction. When an exit instruction is executed in any clusterthe effect is to clear the seed predicate in all clusters. This causesall clusters to enter the loop shutdown phase after completing thecurrent loop iteration.

Further details on the use of predicates in software pipelined loops maybe found in United Kingdom patent application number 0014432.9 in thename of Siroyan Limited, the entire subject matter of which isincorporated herein by reference.

Processors embodying the present invention are hardware pipelined inorder to maximise the rate at which they process instructions. Hardwarepipelining works by implementing each of a plurality of phases ofinstruction execution as a single pipeline stage. Instructions flowthrough successive pipeline stages, in a production-line fashion, withall partially-completed instructions moving one stage forward on eachprocessor clock cycle. Each of the execution units 30, 32 in FIGS. 2 and60, 62 in FIG. 3 is arranged as a hardware pipeline having a number ofpipeline stages.

FIG. 7 shows an example of the pipeline stages that may be present inthe various clusters. For simplicity, a single pipeline is shown foreach cluster, although it will be appreciated that two or more pipelinesmay be provided in each cluster. In the pipelines of FIG. 7,instructions flow through the pipelines from left to right; thus a stagewhich is to the left of another stage in FIG. 7 is referred to as beingbefore, or earlier than, that stage. The various stages in the pipelinesare as follows.

-   -   VA—VLIW address stage. The address of the next instruction is        computed in this stage in the master cluster.    -   VTIA—V-cache tags and instruction address. This stage is used to        propagate the address of the next instruction from the master        cluster to the slave cluster. In addition, the master cluster        performs a V-tag comparison to establish whether the required        instruction is in the V-cache (cache hit).    -   IF—instruction fetch. The VLIW instructions are fetched from        memory into the pipelines in the various clusters.    -   D—instruction decode. The instructions are decoded to determine        the type of instruction and which registers are to be the source        and the destination for the instruction, and literals are        extracted from the instruction.    -   X1—execute 1. First execution cycle.    -   X2—execute 2. Second execution cycle.    -   X3—execute 3. Third execution cycle.    -   C—commit. The instruction result is obtained and, unless an        exception has occurred, it will commit to causing a change to        the processor state.

In the VLIW instruction set which is used by present processor there areseveral instructions which can directly affect the sequencing ofsubsequent VLIW-packets. These VLIW instructions are referred to ascontrol instructions. Such control instructions are always scheduled tobe processed by the master cluster. Examples of such controlinstructions are as follows:

-   -   branch—this instruction causes the program to branch to another        address. If this instruction is executed, earlier instructions        in the pipelines will usually need to be discarded.    -   loop—this instruction initiates a VLIW loop. If this instruction        is executed, it may be necessary to discard earlier instructions        from the pipelines if the loop body is less than three packets        and the total number of iterations is greater than one.    -   rv (return from VLIW mode)—this instruction causes the processor        to change from VLIW mode to scalar mode. If this instruction is        executed, earlier instructions in the pipelines need to be        discarded.    -   exit—this instruction causes the program to exit early from a        loop. Depending on the way in which the exit is handled, one or        more earlier instructions in the pipelines may need to be        discarded.

Each of the above control instructions, if executed, may cause changesto the sequencing of subsequent instruction packets. However, such aninstruction will only execute if the guard predicate corresponding tothat instruction is true. The state of the guard predicate is assessedwhen the instruction is in the X1 stage. By that stage, potentiallyunwanted instructions from instruction packets following the packet withthe control instruction will have already been issued to the variouspipelines. Thus, if such a control instruction is executed, it may benecessary to discard subsequent instructions that have already beenloaded into the pipelines, and to undo any effects of thoseinstructions. As will now be explained, discarding such unwantedinstructions may be difficult for a variety of reasons.

A first difficultly in discarding any unwanted instructions arises dueto the fact that corresponding instructions (i.e. instructions from thesame instruction packet) in different clusters may not always be in thesame pipeline stage at the same time. This may be due to, for example,the way in which stall signals are communicated in the processor. Asdisclosed in co-pending United Kingdom patent application number0027294.8 in the name of Siroyan Limited, the entire contents of whichare incorporated herein by reference, corresponding instructions indifferent pipelines may be allowed to become temporarily out of stepwith each other, in order to allow time for a stall signal to bedistributed between pipelines. In embodiments of the present invention,a stall signal which is generated by one cluster takes effect in thatcluster on the next clock edge, but does not take effect in otherclusters until one clock cycle after that. This allows at least oneclock cycle for the stall signal to be distributed throughout theprocessor. The result of this stalling mechanism is that theinstructions in different pipelines may not be aligned with each other.

An example of unaligned stale packets is shown in FIG. 8. In thisexample it is assumed that the X2 stages in clusters 0 and 2 have bothgenerated stall signals. These signals cause clusters 0 and 2 to stallimmediately, while clusters 1 and 3 are stalled one clock cycle later.As a result, the instructions in clusters 1 and 3 advance one stageahead of the corresponding instructions in clusters 0 and 2. If acontrol instruction (such as an exit instruction, as shown in FIG. 8) isacted on in the X1 stage of cluster 2, then it is necessary to discardthe instructions in the VTIA, IF and D stages of clusters 0 and 2, andfrom the VTIA, IF, D and X1 stages of clusters 1 and 3. The logicrequired to deleted the unwanted packets is therefore complex due to thefact that the instructions in the pipelines may not be aligned.

In addition to the non-alignment problem, the number of packets whichare stale and need to be deleted depends on the type of controlinstruction. In the case of a branch instruction or a rv instruction,all subsequent packets that have already issued are unwanted. In thecase of a loop instruction, the first unwanted packet can vary dependingon factors such as the loop size, number of loop iterations and numberof epilogue loop iterations. For example, if the loop size is one, andthe number of loop iterations is greater than one, then the firstsubsequent packet could be retained but the second discarded. If theloop size is two then the first two subsequent packets could beretained. Alternatively, if the number of iterations is only one thenall packets could be retained since the order of packet issue wouldremain unchanged.

In the case of an exit instruction, the number of packets which need tobe discarded depends on loop size, number of loop iterations remaining,the number of epilogue loop iterations, and the exit instruction'sposition relative to the end of the loop body. In addition to decidingwhich packets are unwanted, predicate registers in other pipelines mayhave to be updated, to allow individual instructions in subsequentpackets which are not deleted to become guarded false. This is necessarydue to the mechanism of shifting predicates during the epilogue shutdown phase of a loop. It may be necessary to create additional stallcycles while globally broadcasting the information required to updatethe predicate registers, since the subsequent instructions will requirethe updated predicate information before they can continue.

The register files to which the execution units have access may use amechanism known as rotation, in order to allow the program to use thesame register address on subsequent iterations of a loop. If an unwantedinstruction in a cluster has caused a register file to rotate, then thatregister file must be returned to its previous (un-rotated) state. Thisis also made more complicated by the packet non-alignment problem, andthe additional stalls required.

Early Resolving Control Transfer Instructions

In an embodiment of the present invention, the pipeline structure of theprocessor is modified so that certain control transfer instructions canbe fetched and decoded early in the pipeline. These instructions arereferred to herein as early resolving instructions. When theseinstructions are detected, they prevent the movement of subsequentpackets through the pipelines until the guard predicate and any otherdata dependencies have been resolved. This avoids the need to deletespecific unwanted instructions in other clusters and to correct anyunwanted rotations in register files.

In an embodiment of the invention, the structure of the master clusteris modified so that the master cluster's portion of a VLIW packet can befetched (and validated using the V-Cache tags) one cycle ahead of theslave's. This can be done in parallel with the propagation of the cacheindex to all the slave clusters. The instruction can then be partiallydecoded in the master cluster in order to initiate a stall before anysubsequent instructions are fetched by the slave clusters. Whenever anearly resolving instruction is detected, the master cluster allows thepacket containing the instruction to continue with a cache hit signal.The cache hit signal is broadcast to the slave clusters in the IF stagein parallel with the slave clusters' instruction fetches. The VTIA stageis stalled to hold off any subsequent packets and the packet containingthe early resolving instruction is allowed to propagate through to theX1 stage followed by several bubbles (or no-operationinstructions/V-cache miss signals) in all pipelines.

FIG. 9 shows parts of a master cluster 14 and a slave cluster 16 inaccordance with an embodiment of the invention. Each of the clusterscomprises a pipeline divided into a plurality of pipeline stages, namelyVA, VTIA, IF, D and X1. For clarity, only one pipeline is shown in eachcluster, and later stages and other slave clusters are not shown.

Master cluster 14 comprises address computation unit 100, VLIW programcounter 102, tag fetching unit 104, instruction fetching unit 106, tagmatching unit 108, pre-decode unit 110, hit register 112, instructionregister 114, hit register 116, decode unit 118 and execute unit 120.Slave cluster 16 comprises instruction fetching unit 122, decode unit124, hit register 126 and execute unit 128. The processor also comprisesstall control unit 130 which is distributed between the variousclusters.

In operation, the address of a next instruction packet is computed inthe VA stage of the master cluster by the address computation unit 100.This address is provided to the VLIW program counter 102. At the sametime, a V-cache index derived from the address is provided to tagfetching unit 104 and instruction fetching unit 106. The V-cache indexconsists of a number of bits from the address, which are used foraccessing the V-cache.

The address computed by the address computation unit 100 is loaded intothe VLIW program counter 102 in the VTIA stage of the master cluster. Atthe same time, the V-cache index is loaded into the tag fetching unit104 and the instruction fetching unit 106. The tag fetching unit 104 andthe instruction fetching unit 106 fetch respectively the cache tag andthe instruction from the address in the V-cache given by the V-cacheindex. The tag matching unit 108 compares the address contained in thecache tag with the address stored in the VLIW program counter 102, todetermine whether the required instruction packet is in the instructioncache. The tag matching unit 108 outputs a cache hit signal whichindicates whether or not the required instruction is in the V-cache.Also in the VTIA stage, the instruction which was fetched by theinstruction fetching unit 106 is partially decoded in pre-decode unit110, to determine whether the instruction is one of a number ofpredetermined control transfer instructions. An output of the pre-decodeunit 106 is fed to stall control unit 130. If the instruction isdetermined to be such an instruction, then the stall control unit 130stalls stage VTIA in the master cluster and all preceding stages on thenext clock cycle. Subsequent instructions are then prevented from movingthrough the pipeline until the stall is removed.

In the IF stage, the V-cache index is loaded into the instructionfetching unit 122 in the slave cluster. The instruction fetching unit122 then fetches the instruction from the address in its V-cachepartition which is given by the V-cache index. In the same stage, theinstruction which was fetched by the instruction fetching unit 106 isloaded into instruction register 114. Since the master cluster hasalready fetched its instruction in the VTIA stage, it is not necessaryfor it to fetch the instruction in the IF stage, and so the instructionin the master cluster is simply held in the instruction register 114 toallow the instructions in the slave clusters to re-align with those inthe master cluster. Also in the IF stage, the cache hit signal is loadedinto the hit register 112. If the hit register 112 indicates that acache miss occurred, then the V-cache is refilled by the codedecompression unit (not shown in FIG. 9).

The instructions in the various pipelines are decoded in the D stage ofthe various clusters by the decode units 118, 124. The decode units 118,124 determine the type of instruction and which registers are to be thesource and the destination for the instruction, and extract literalsfrom the instruction. In the X1 stage, the decoded instructions are fedto the execution units 120, 128 for execution. In this stage, thepredicate registers are checked to determine whether or not theinstructions are to be executed. The hit signal also progresses throughthe pipelines in hit registers 116, 126, to indicate whether or not theinstructions are valid.

If a stall is asserted by stall control unit 130 in the VTIA stage ofthe master cluster, then the VTIA stage and all preceding stages in themaster cluster are stalled on the next clock cycle. By then the earlyresolving instruction has moved to the IF stage of the master cluster.At the same time, the corresponding instruction in the slave cluster isfetched by instruction fetch unit 122. Thus, the early resolvinginstruction in the master cluster and the corresponding instruction inthe slave cluster are allowed to progress through the pipelines, whilesubsequent instructions are held off by the stall signal.

In the clock cycle after the stall is asserted, the early resolvinginstruction progresses to the D stage, while the subsequent instructionis held in the VTIA stage due to the stall signal. The hit register 112is reset, indicating that any instruction in register 114 is to beignored. Any instruction which is fetched by instruction fetch unit 122in that clock cycle must also be ignored. This is achieved by feedingthe output of hit register 114 to hit register 126 in the next clockcycle.

In the next clock cycle, the early resolving instruction has moved tothe X1 stage of the master cluster, and hit registers 112 and 116 areboth reset, indicating that the instructions in both the IF stage and Dstage of the master cluster are to be ignored. Thus a two-stage bubbleappears in the pipeline between the early resolving instruction and thesubsequent instruction. Any subsequent instruction which is loaded intothe slave cluster also has a “miss” signal associated with it, so that atwo-stage bubble also appears in the slave cluster.

If a control transfer instruction is executed in the X1 stage of themaster cluster, then the subsequent instructions would normally have tobe flushed from the various pipelines. However, in the presentembodiment, since the master cluster is stalled by the stall controlunit 130, these unwanted instructions are prevented from being loadedinto the pipelines in the first place, and instead “miss” signals arefed into the respective hit registers. Therefore, in the presentembodiment, it is not necessary to implement a complicated mechanism forremoving the unwanted instructions from the pipelines. Once the controltransfer instruction has been acted on, it is possible to remove thestall signal, since the correct addresses will be computed by theaddress computation unit 100 in subsequent clock cycles.

In the present embodiment, the next instruction after the earlyresolving instruction is held in the VTIA stage of the master pipeline,while the early resolving instruction progresses through to the X1stage. Bubbles, or no-operation instructions, are inserted into the IFand D stages. If the early resolving instruction is not executed (forexample, because it is guarded false) then the next instruction, whichis in the VTIA stage, can still be used. In this case, since there aretwo bubbles between the early resolving instruction and the nextinstruction, two clock cycles will be wasted. However, if the earlyresolving instruction is executed, then the instruction held in the VTIAstage of the master pipeline is squashed, and the next instruction isre-loaded based on the next address calculated by the addresscomputation unit 100. The mechanism needed to squash the instructionheld in the VTIA stage can be implemented easily, because it is alwaysthe VTIA stage in the master cluster which contains the instruction tobe squashed.

Although the mechanism described above results in two wasted clockcycles in cases where the control transfer instruction is not executed,this is considered acceptable, particularly in cases where theinstruction is more likely to be guarded true than false. In such casesthe saving made by not needing to provide the mechanism for flushing allpipelines of unwanted instructions outweighs the disadvantage ofoccasionally having wasted clock cycles.

The mechanism described above is illustrated in FIGS. 10, using theexample of a loop instruction. In FIG. 10(a) the loop instruction is inthe IF stage, a cache hit signal is broadcast to the slave clusters, anda stall is applied to the VTIA stage. In FIG. 10(b), the loopinstruction is in the X1 stage. Bubbles (no-operation instructions) haveappeared in the IF and D stages due to the stall in the VTIA stage. Thenext instruction, in this case lb1 (loop body 1), is held in the VTIAstage. The guard predicate of the loop instruction is examined, and theVTIA stall is removed.

If the loop instruction is acted on, then the lb1 instruction in theVTIA stage is the first instruction of the loop body, and couldpotentially still be used from the VTIA stage. However, all of the loopcontext registers also need to be set up in order to start the loopsequencing correctly. Hence, in the present embodiment, this instructionis discarded regardless and lb1 is re-issued as the next instruction.

FIGS. 11 to 13 show examples of the stalling conditions for loop, branchand rv (return from VLIW mode) instructions respectively. In FIGS. 11 to13, successive rows indicate the contents of the various pipeline stagesin the master cluster on successive clock cycles. Referring to FIG. 11,in the first clock cycle a loop instruction is detected in the VTIAstage. In the second clock cycle the VTIA stage is stalled, so that theinstruction lb1 (loop body 1) is held up in the VTIA stage. In the thirdclock cycle a bubble appears in the IF stage due to the stall. In thefourth clock cycle a further bubble appears, and the loop instruction isresolved in the X1 stage. In this example, the loop instruction isguarded true and the instruction is executed. The stall is then removed.In the fifth clock cycle, the lb1 instruction which was held in VTIA issquashed and the instruction is reloaded from the VA stage. As aconsequence, a further bubble appears in the pipeline. In the sixth andseventh clock cycles the instructions continue their progress throughthe pipeline, but with three bubbles now in the pipeline.

FIG. 12 shows the stalling conditions when a branch instruction isdetected. The mechanism is similar to that described with reference toFIG. 10, except that in the fifth clock cycle, the next instruction(instruction a) is squashed and the instruction at the branch target(bt) is loaded into the VTIA stage.

FIG. 13 shows the stalling conditions when an rv instruction isdetected. In this case, subsequent instructions (instructions a, b andc) which are loaded into the pipeline are squashed, until the rvinstruction reaches the C stage.

The above examples assume that the control transfer instruction isguarded true. If the control transfer instruction were guarded false,then the instruction held in the VTIA stage would not be squashed in thefifth clock cycle. As a consequence that instruction would be allowed toprogress through the pipeline, and only two bubbles would appear in thepipeline. Thus, only two clock cycles being wasted in cases where thecontrol transfer instruction is not executed.

The early resolving instruction decoding mechanism and the associatedV-cache pipeline design results in a short and efficient pipeline forperforming both instruction fetching and the broadcasting of globalsignals (cache index in the VTIA stage and cache hit in the IF stage).This may reduce the number of pipeline registers resulting in a lowerchip area and lower power consumption. Short branch penalties may beincurred while waiting for branch instructions to be predicated in theX1 stage of the processor, which may result in faster processing.

Although the above description relates, by way of example, to aclustered VLIW processor it will be appreciated that the presentinvention is applicable to any processor having at least two parallelpipelines. Thus the invention may be applied to parallel processorsother than VLIW processors, and to processors not having clusteredpipelines. A processor embodying the present invention may be includedas a processor “core” in a highly-integrated “system-on-a-chip” (SOC)for use in multimedia applications, network routers, video mobilephones, intelligent automobiles, digital television, voice recognition,3D games, etc.

1. A processor comprising a plurality of parallel pipelines which perform a series of operations on instructions from an instruction packet passing through the pipelines, wherein a first pipeline comprises a pre-decoder arranged to at least partially an instruction in the first pipeline before an instruction from a subsequent instruction packet is fetched by a second pipeline, the pre-decoder being arranged to determine whether the instruction is an instruction which, if executed, may cause a sequence of subsequent instructions to be different from that if the instruction was not executed.
 2. A processor according to claim 1, wherein the second pipeline comprises an instruction fetch stage, and the pre-decoder is provided in a stage corresponding to the instruction fetch stage in the second pipeline.
 3. A processor according to claim 1, wherein the second pipeline comprises an instruction fetch stage, and the pre-decoder is provided in an earlier stage than that corresponding to the instruction fetch stage in the second pipeline.
 4. A processor according to claim 1, wherein the pre-decoder is arranged to determine whether the instruction is one of a predetermined set of instructions.
 5. A processor according to claim 4, wherein the predetermined set of instructions comprises instructions that may only be executed by the first pipeline.
 6. A processor according to claim 1, further comprising stall control circuitry which stalls the progress of a subsequent instruction packet in dependence on an output of the pre-decoder.
 7. A processor according to claim 6, wherein the stall control circuitry is arranged such that a subsequent instruction remains in a stage of the first pipeline while a stall is asserted.
 8. A processor according to claim 6, wherein the first pipeline comprises an execute stage, the execute stage comprising circuitry which determines whether an instruction in the execute stage is to be executed and the stall control circuitry is arranged to release a stall when it has been determined whether an instruction which caused the stall is to be executed. 9-19. Cancelled.
 20. A processor according to claim 8, further comprising circuitry which deletes a subsequent instruction if it is determined that the instruction which caused the stall is to be executed.
 21. A processor according to claim 1, wherein the processor is arranged such that, in dependence on an output of the pre-decoder, an instruction from a subsequent instruction packet is prevented from being loaded into the second pipeline in an executable form.
 22. A processor according to claim 1, wherein the first pipeline comprises a decode stage which fully decodes an instruction, the decode stage being after the stage in which the pre-decoder is provided.
 23. A processor according to claim 1, each of the pipelines having instruction fetch circuitry which fetches an instruction to be executed, wherein the instruction fetch circuitry in the first pipeline is in an earlier pipeline stage than the instruction fetch circuitry in the second pipeline.
 24. A processor according to claim 23, wherein the instruction fetch circuitry in the first pipeline is in the same pipeline stage as the pre-decoder.
 25. A processor according to claim 1, wherein the processor is arranged such that corresponding instructions in different pipelines may become unaligned for at least one clock cycle.
 26. A processor according to claim 1, the processor comprising a plurality of pipeline clusters, each cluster comprising a plurality of pipelines, the first pipeline being in a first cluster and the second pipeline being in a second cluster.
 27. A processor according to claim 1, the processor being a VLIW processor in which instructions are issued in parallel to the plurality of pipelines.
 28. A method of operating a processor, the processor comprising a plurality of parallel pipelines which perform a series of operations on a group of instructions passing through the pipelines, the method comprising the step of at least partially decoding an instruction in a first pipeline before an instruction from a subsequent group of instructions is fetched by a second pipeline, wherein the step of at least partially decoding an instruction comprises determining whether the instruction is an instruction which, if executed, may cause a sequence of subsequent instructions to be different from that if the instruction was not executed.
 29. A method according to claim 28, wherein the step of at least partially decoding an instruction comprises determining whether the instruction is one of a predetermined set of instructions.
 30. A method according to claim 29, wherein the predetermined set of instructions comprises instructions that may only be executed by the first pipeline.
 31. A method according to claim 28, further comprising the step of stalling the progress of a subsequent instruction packet in dependence on a result of the step of partially decoding the instruction.
 32. A method according to claim 31, wherein a subsequent instruction remains in a stage of the first pipeline while a stall is asserted.
 33. A method according to claim 31, further comprising the steps of: determining, in an execute stage in the first pipeline, whether an instruction in the execute stage is to be executed; and releasing a stall when it has been determined whether an instruction which caused the stall is to be executed.
 34. A method according to claim 33, further comprising the step of deleting a subsequent instruction if it is determined that the instruction which caused the stall is to be executed.
 35. A method according to claim 28, wherein an instruction from a subsequent instruction packet is prevented from being loaded into the second pipeline in an executable form in dependence on a result of the step of partial decoding the instruction. 